23 research outputs found

    Conditional Dependencies: A Principled Approach to Improving Data Quality

    Get PDF
    Abstract. Real-life date is often dirty and costs billions of pounds to businesses worldwide each year. This paper presents a promising ap-proach to improving data quality. It effectively detects and fixes inconsis-tencies in real-life data based on conditional dependencies, an extension of database dependencies by enforcing bindings of semantically related data values. It accurately identifies records from unreliable data sources by leveraging relative candidate keys, an extension of keys for relations by supporting similarity and matching operators across relations. In con-trast to traditional dependencies that were developed for improving the quality of schema, the revised constraints are proposed to improve the quality of data. These constraints yield practical techniques for data re-pairing and record matching in a uniform framework.

    Multi-source statistics:Basic situations and methods

    Get PDF
    Many National Statistical Institutes (NSIs), especially in Europe, are moving from single‐source statistics to multi‐source statistics. By combining data sources, NSIs can produce more detailed and more timely statistics and respond more quickly to events in society. By combining survey data with already available administrative data and Big Data, NSIs can save data collection and processing costs and reduce the burden on respondents. However, multi‐source statistics come with new problems that need to be overcome before the resulting output quality is sufficiently high and before those statistics can be produced efficiently. What complicates the production of multi‐source statistics is that they come in many different varieties as data sets can be combined in many different ways. Given the rapidly increasing importance of producing multi‐source statistics in Official Statistics, there has been considerable research activity in this area over the last few years, and some frameworks have been developed for multi‐source statistics. Useful as these frameworks are, they generally do not give guidelines to which method could be applied in a certain situation arising in practice. In this paper, we aim to fill that gap, structure the world of multi‐source statistics and its problems and provide some guidance to suitable methods for these problems

    Dynamic Similarity-Aware Inverted Indexing for Real-Time Entity Resolution ⋆

    No full text
    Abstract. Entity resolution is the process of identifying groups of records in a single or multiple data sources that represent the same real-world entity. It is an important tool in data de-duplication, in linking records across databases, and in matching query records against a database of existing entities. Most existing entity resolution techniques complete the resolution process offline and on static databases. However, real-world databases are often dynamic, and increasingly organizations need to resolve entities in real-time. Thus, there is a need for new techniques that facilitate working with dynamic databases in real-time. In this paper, we propose a dynamic similarity-aware inverted indexing technique (DySimII) that meets these requirements. We also propose a frequencyfiltered indexing technique where only the most frequent attribute values are indexed. We experimentally evaluate our techniques on a large realworld voter database. The results show that when the index size grows no appreciable increase is found in the average record insertion time (around 0.1 msec) and in the average query time (less than 0.1 sec). We also find that applying the frequency-filtered approach reduces the index size with only a slight drop in recall
    corecore